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B.  INTRODUCTION 


In  clinical  follow-up  studies,  subjects  are  monitored  at  regular  time  intervals  for  a 
physical  condition.  It  is  often  the  case  that  an  event  under  observation  can  take  place  in 
between  two  successive  visits,  and  it  may  not  be  possible  for  the  subject  to  know  the  time 
to  such  an  event  exactly.  For  example,  consider  the  situation  in  which  a  group  of  women 
at  high  risk  for  breast  cancer  is  asked  to  take  a  chemopreventive  substance  for  a  fixed  time 
period.  At  the  end  of  the  period,  each  participating  woman  is  required  to  submit  a  blood 
or  urine  sample  at  regular  intervals  in  order  to  monitor  the  level  of  a  validated  intermediate 
biomarker.  Let  X  denote  the  time  from  cessation  of  use  of  the  agent  to  the  loss  of  its 
protective  effect,  quantified  as  a  return  to  baseline  value  of  the  biomarker.  If  a  woman 
submits  a  sample  for  assay  on  a  daily  basis,  the  value  of  X  can  be  observed  exactly,  unless 
the  protective  effect  is  still  present  by  the  time  the  study  is  terminated  so  that  X  is  right 
censored  in  the  usual  sense  of  survival  analysis.  In  practice,  however,  the  follow-up  interval 
can  be  a  week  or  longer;  therefore  the  exact  value  of  X  is  generally  unknown  but  is  known  to 
lie  between  the  time  points  L  and  R,  where  L  is  the  number  of  days  from  cessation  of  agent 
intake  to  the  last  time  the  sample  was  assayed  and  the  protective  effect  was  still  present,  and 
R  is  the  number  of  days  from  cessation  of  agent  intake  to  the  most  recent  time  the  sample 
was  assayed.  If  the  protective  effect  is  still  present,  then  R  takes  the  value  infinity.  In  any 
case,  when  the  value  of  X  is  only  known  to  lie  between  (L,  R),  we  say  that  X  is  censored  in 
the  interval  {L,R).  Therefore  the  observed  data  consist  of  either  censoring  intervals  (L,  i?) 
or  exact  observations  X  =  L  =  R. 

Our  research  project  is  concerned  with  nonparametric  estimation  of  the  distribution 
function  F{t)  —  Pr{X  <  t)  of  a  real- valued  random  variable  X,  or  equivalently  its  survival 
function  S{t)  =  1  —  when  the  sample  data  are  incomplete  due  to  restricted  observation 
brought  about  by  interval  censoring.  Generalized  maximum  likelihood  (GML)  method  in 
the  sense  of  Kiefer  and  Wolfowitz  [1]  is  the  standard  practice  of  estimating  S.  At  present, 
there  are  two  iterative  computation  procedures  that  will  yield  the  GML  estimate  (GMLE) 
of  S  at  convergence.  The  first  one  is  due  to  Peto  [2]  and  makes  use  of  the  Newton’s  method. 
The  second  is  due  to  Turnbull  [3]  and  makes  use  of  a  simpler  but  slower  algorithm  called  self- 
consistent  algorithm.  A  solution  to  this  algorithm  is  also  called  a  self-consistent  estimator 
(SCE). 

Because  there  is  no  closed-form  expression  for  the  GMLE  of  S,  it  has  been  difficult  to 
study  its  asymptotic  statistical  properties,  including  consistency,  normality  and  efficiency. 
Such  a  setback  in  the  statistical  development  of  the  GMLE  has  severely  limited  its  use  in 
the  statistical  analysis  of  interval-censored  (IC)  data. 

Before  we  began  our  funded  Army  research,  we  had  extended  Efron’s  redistribution-to- 
the-right  idea  for  right-censored  data  [4]  and  proposed  a  redistribution-to-the-center  (RTC) 
method  to  yield  a  nonparametric  estimator  of  S  which  are  called  RTC  estimate  (RTCE). 
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Such  an  estimator  has  a  closed-form  expression  and  can  be  readily  calculated  for  IC  data 
of  any  dimension.  IC  data  are  said  to  satisfy  DI  (disjoint  or  included)  condition  if  for  every 
two  censoring  intervals,  either  they  are  disjoint  or  one  is  a  subset  of  the  other.  For  instance, 
in  a  clinical  study  in  which  every  subject  has  the  same  follow-up  schedule,  say  at  time  point 
ai,  a2,  ttfc,  then  {L,R}  =  {0,ai},  or  {oi,ai+i}  or  {ai,oo}.  A  sample  of  such  IC  data 
{Ln,Rn}  will  satisfy  DI  condition.  We  had  shown  that  under  DI  condition, 
RTCE  is  actutally  GMLE  itself.  This  important  observation,  together  with  the  availability 
of  an  explicit  expression,  had  motivated  us  to  submit  the  present  proposal  on  RTCE  to  the 
Army. 

In  our  first  year  of  research,  we  completed  our  research  for  Task  1  and  Task  2  in 
the  Statement  of  Work  for  RTCE.  However,  we  also  discovered  that  in  the  case  of  non- 
DI  data,  RTCE  may  be  different  from  GMLE,  and  RTCE  is  not  always  consistent.  The 
interesting  and  intriguing  observation  is  that  the  difference  between  RTCE  and  GMLE  is 
small,  at  least  based  on  our  limited  simulation  studies  [5].  In  establishing  consistency  result 
for  RTCE  under  DI  condition,  we  had  gained  important  insight  into  proofs  of  asymptotic 
properties  for  GMLE,  which  does  not  possess  a  closed-form  expression.  Because  GMLE  is 
the  preferred  estimator  for  S,  we  decided  to  focus  our  attention  on  GMLE  instead  of  RTCE 
for  the  remainder  of  the  funded  research,  and  we  have  successfully  completed  all  the  tasks 
stated  in  the  Statement  of  Work  for  GMLE. 

C.  BODY 

C.l.  Basic  setup 

Interval-censored  data  can  arise  in  the  following  four  situations: 

1.  Case  2  IC  data  (C2  data)  consist  of  right-censored  (R  =  oo),  left-censored  (L  =  0)  and 
strictly  interval-censored  observations  (0  <  L  <  R  <  oo).  These  are  by  far  the  most 
common  type  of  IC  data  in  clinical  follow-up  studies. 

2.  Mixed  IC  data  (MIC  data)  consist  of  both  C2  data  and  exact  observations  (L  =  R). 
Yu,  Li  and  Wong  [6]  presented  an  example  involving  MIC  data  from  a  breast  cancer 
follow-up  study. 

3.  Case  1  IC  data  (Cl  data))  consist  of  either  right-censored  or  left-censored  observations. 
For  example,  when  an  animal  is  sacrificed  for  inspection  of  a  tumor  formation,  time  to 
appearance  of  the  tumor  is  Cl  interval  censored.  Examples  of  Cl  data  can  be  found  in 
[7]  and  [8]. 

4.  Doubly-censored  data  (DC  data)  consist  of  right-,  left-censored  and  exact  observations. 
An  example  with  DC  data  is  given  in  [9]. 

We  have  formulated  four  different  interval  censorship  models  corresponding  to  the  four 
IC  data  types.  To  study  the  asymptotic  properties  of  the  GMLE,  we  make  use  of  the 
following  assumptions: 
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(ASl)  The  censoring  distribution  is  discrete  but  the  survival  distribution  is  arbitrary. 
(AS2)  The  support  set  of  the  censoring  vector  is  finite,  but  the  survival  distribution  is 
arbitrary. 

(ASS)  A  probability  restriction.  See  Section  C. 

(AS4)  A  probability  restriction.  See  Section  C. 

(AS5)  The  censoring  distribution  and  the  survival  distribution  are  arbitrary,  but  have 
to  satisfy  some  regularity  conditions,  stated  in  Gu  and  Zhang  [10]. 

C.2.  Case  1  model 

Case  1  model  for  Cl  data  assumes  that  the  survival  time  X  and  a  random  inspection 
time  Y  are  independent.  We  always  observe  Y.  However,  X  is  not  fully  observed  except 
that  we  know  that  either  X  <Y  or  X  >  Y,  Under  assumption  ASl,  we  have  shown  that 
GMLE  is  strongly  consistent,  asymptotically  normal  and  asymptotically  efficient  at  all  the 
inspection  times.  The  results  are  published  in  Yu,  Schick,  Li  and  Wong  [11]. 

C.3.  Case  2  model 

The  C2  model  for  C2  data  assumes  that  X  and  the  random  censoring  vector  (F,  Z)  are 
independent  and  that  Y  <  Z  with  probability  one.  We  do  not  observe  X  except  that  we 
know  X  is  before  Y,  or  between  Y  and  Z,  or  after  Z.  We  state  an  assumption  for  C2  model 
as  follows: 

(ASS)  PIA  E  liD I j}  >  0  for  any  two  realizations  of  (L,  R),  (Li,  Ri)  =  R  and  (Lj,Rj)  = 
Ij ,  provided  R  D  Ij  ^  0. 

Under  the  assumption  ASl,  we  have  shown  that  GMLE  is  strongly  consistent.  Under 
the  assumptions  AS2  and  ASS,  we  have  shown  that  GMLE  is  asymptotically  normal  and 
efficient.  The  results  are  published  in  Yu,  Schick,  Li  and  Wong  [12]. 

C.4.  MIC  model 

Mixture  interval  censorship  (MIC)  model  for  MIC  data  assumes  that  an  IC  observation 
is  drawn  from  a  probability  mixture  of  C2  model  and  the  usual  right  censorship  model  for 
right-censored  data. 

Define  r  =  sup{t;  Pr(min(A’,T)  <  t)  <  1},  ry  =  sup{t;  Pr{Y  <  t)  =  0}.  and 
Tz  =  sup{t;  Pr{Z  <t)<  1}.  We  assume  that  r  >  tz-  We  state  an  assumption  for  MIC 
model  as  follows: 

(AS4)  Pr(L  =  r)  >  0  if  Pr(X  <  r)  <  1  and  Pr(P  =  ry)  >  0  if  Pr{X  <  ry)  >  0. 

Under  assumptions  AS2  and  AS4,  we  have  shown  that  GMLE  is  strongly  consistent 
(Yu,  Li  and  Wong  [6]),  and  under  assumptions  AS2,  ASS  and  AS4,  GMLE  is  asymptotically 
normal  (Yu,  Li  and  Wong  [13]).  Recently,  we  have  been  able  to  establish  these  asymptotic 
properties  without  the  need  of  assumption  AS2.  A  manuscript  on  these  results  has  been 
submitted  for  publication  (Yu,  Li  and  Wong  [14]). 

C.5.  DC  model 

The  DC  model  for  DC  data  assumes  that  X  and  a  random  vector  (Y,  Z)  are  independent 
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and  Y  <  Z  with  probability  one,  and  that  X  is  uncensored  if  y  <  X  <  Z,  right  censored  if 
Z  <  X  and  left  censored  if  X  <Y.  Let  Sz  and  Sy  be  the  survival  functions  of  Z  and  Y, 
respectively,  and  let  K  =  Sy  —  Sz-  We  state  an  assumption  for  DC  model  as  follows: 


(AS5)  K{x—)  >  0  for  all  x  such  that  5(a;)  <  1  and  S{x—)  >  0, 


We  have  shown  in  a  submitted  manuscript  (Yu  and  Wong  [15])  that  in  order  to  establish 
asymptotic  results,  GMLE  has  to  be  modified.  Under  assumptions  AS4  and  AS5  we  have 
shown  that  the  modified  GMLE  is  strongly  consistent  and  is  asymptotically  normal  and 
efficient  under  assumptions  ASS,  AS4  and  AS5. 

C.6.  Two-sample  nonparatmetric  test 

Based  on  the  asymptotic  results  that  we  have  established  for  different  IC  models,  we 
have  successively  derived  the  asymptotic  distribution  of  the  following  two-sample  distance 
test  statistics  for  each  model: 


w{tisi{t)  -  S2{t))dt, 


where  ri  and  T2  are  specified  time  point  and  W  (t)  is  a  weight  function.  A  manuscript  on 
the  asymptotic  results  of  D  is  being  submitted  for  publication  (Wong  and  Yu  [16]). 


C.7.  Proportional  hazards  model 

In  om:  original  proposal,  we  had  assigned  three  months  of  time  for  Task  7  on  Cox  re¬ 
gression  for  IC  data.  However,  we  have  realized  that  statistical  inference  for  the  parameter 
P  in  Cox  regression  under  interval  censorship  is  much  more  involved  than  its  counterpart 
in  the  usual  right-censored  situation.  In  the  latter  case,  the  maximum  likelihood  estimator 
(MLE)  of  /5  does  not  depend  on  the  baseline  survival  function  S'o(t)  owing  to  the  simple  na¬ 
ture  of  the  partial  likelihood.  However,  such  simplicity  of  likelihood  function  does  not  carry 
over  to  the  interval  censorship  model,  and  maximum  likelihood  estimation  of  ^  will  involve 
GML  estimation  of  Soit)  at  the  same  time,  thus  resulting  in  a  difficult  high-dimensional 
estimation  problem. 

Under  the  restrictive  assumption  that  both  X  and  the  censoring  vector  take  on  finitely 
many  values,  we  have  proved  that  the  MLE  of  P  and  the  GMLE  of  So{t),  and  hence  the 
survival  function  S'(t|X)  =  SQ{t)exj^—,  where  Z  denotes  a  vector  of  covariates  for  Cox 
regression,  are  consistent  and  asymptotically  normal  (Li,  Yu  and  Wong  [17]).  Much  more 
effort  is  needed  to  pursue  research  on  the  asymptotic  inference  of  Cox  regression  model 
under  more  relaxed  assumptions  on  the  distributions  of  X  and  the  censoring  vector. 

C.8.  Computer  software 

We  have  made  it  available  to  the  public  a  set  of  computer  programs  for  calculating 
RTCE  and  GMLE,  for  carrying  out  asymptotic  inference  of  GMLE  for  all  patterns  of  interval 
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censorship,  and  for  evaluating  the  Z-score  of  the  proposed  two-sample  weighted  distance  test. 
These  programs  can  be  accessed  via  the  internet  at  qyu@math.binghamton.edu. 

C.9.  Applications  to  breast  cancer  research 

We  have  applied  our  results  on  asymptotic  inference  of  GMLE  for  C2  model  to  two 
breast  cancer  research  projects.  The  first  project  is  concerned  with  a  chemoprevention 
intervention  trial  of  indole-3-carbinol  (I3C)  for  breast  cancer  which  is  being  conducted  at 
Strang  Cancer  Prevention  Center.  The  statistical  question  of  interest  is  the  estimation  of 
duration  of  sustaining  effect  of  I3C,  which  is  C2  censored.  A  preliminary  report  on  a  short¬ 
term  trial  has  recently  been  published  [18];  however,  a  longer  trial  lasting  for  more  than  one 
year  is  still  ongoing  so  that  more  informative  data  on  duration  of  sustaining  effect  can  be 
obtained. 

The  second  project  is  a  standard  breast  cancer  relapse  follow-up  study  based  on  data 
from  374  women  with  stages  I  -  III  unilateral  invasive  breast  cancer  surgically  treated  at 
Memorial  Sloan-Kettering  Cancer  Center  between  1985  and  1990.  The  median  follow-up 
duration  was  46  months.  Relapse  time  was  given  by  the  time  interval  between  surgery  and 
the  initial  relapse.  A  relapse  that  took  place  between  two  successive  follow-up  visits  was 
regarded  as  interval  censored.  If  a  patient  did  not  relapse  towards  the  end  of  the  study, 
then  her  relapse  time  was  right  censored.  Of  the  374  observations,  300  were  right  censored 
(no  relapse),  21  were  left  censored  and  53  were  strictly  interval  censored  (74  relapses).  Bone 
marrow  micrometastasis  (BMM)  was  determined  for  each  woman  at  the  time  of  surgery.  An 
important  question  is  whether  remission  duration  is  related  to  the  extent  of  initial  tumor 
burden  defined  as  number  of  BMM  cells  detected.  Figure  1  compares  the  relapse-free  GMLE 
curves  of  patients  with  number  of  BMM  <  14  versus  those  with  number  of  BMM  >  14.  Our 
asymptotic  two-sample  distance  test  yielded  a  P  value  close  to  0.1.  An  abstract  on  a  detailed 
prognostic  analysis  of  the  entire  data  set  using  om  asymptotic  results  on  C2  data  has  been 
accepted  for  presentation  at  the  annual  San  Antonio  Breast  Cancer  Symposium  in  December 
1998. 
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D.  CONCLUSIONS 

In  the  four  years  of  our  Army  grant,  we  have  successfully  completed  our  research  ob¬ 
jectives  on  the  asymptotic  inference  of  the  GMLE  of  the  survival  function  under  different 
interval  censorship  models,  including  consistency,  asymptotic  normality  and  asymptotic 
efficiency.  The  results  which  we  have  established  provided  clinicians  and  basic  science  re¬ 
searchers  in  breast  cancer  with  a  set  of  fundamentally  important  statistical  tools  for  the 
analysis  of  all  types  of  interval-censored  data  (C2,  MIC,  DC  and  Cl  data)  that  are  en¬ 
counter  in  breast  cancer  research.  We  have  also  made  available  to  the  general  public  a 
set  of  computer  programs  for  carrying  out  the  asymptotic  generalized  maximum  likelihood 
inference  procedme  for  all  types  of  interval-censored  data. 
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